Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queues: benchmark implementations against each other #2276

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

polybeandip
Copy link
Collaborator

@polybeandip polybeandip commented Sep 3, 2024

Makes progress towards #2221

Changes:

  • implement strict and round_robin scheduling transactions via stable_binheap
  • use generate_name in ComponentBuilder.case; this way, there are less conflicts when a component uses multiple cases
    • change the associated case test
  • tweak gen_test_data.sh to copy round_robin and strict data files into tests/binheap/
  • make clean_test_data.sh purge data files in tests/binheap/round_robin and tests/binheap/strict
  • make runt.toml test our new queues
  • setup cycles.sh to generate cycles counts for all tests
  • setup resources.sh to generate synthesis results for all tests

- Setup flit install
- Tweak action to install queues pkg
- Run test suite with runt
- alter CI to run gen script before runt tests
- keep tests/binheap/binheap_test.{data, expect} (they're small)
- seperate pkg install and data gen into different jobs
- fix SDN tests
- make binheap tests run
- also tweak calyx-py case test
@anshumanmohan
Copy link
Contributor

Hi @polybeandip, thanks for getting this going! Since you are knee deep in this stuff, I would like to tack on a quick request that should be easy to handle. @parthsarkar17 is interested in adding some queue-ey stuff to his benchmarking suite that optimize FSMs and such. Could you please pass him designs for one or two of our more complicated tree schedulers? For each design, he'll need a .futil file, a .data file, and a .expect file.

Some candidates that come to mind:

  • Cassandra did up a "complex tree" that uses only rr and strict but has height of 2 or 3.
  • Some of your newest implementations of schedulers using stable binary heaps.

- setup script for generating cycles counts: cycles.sh
- report cycles counts in cycles.txt
- tweak clean_test_data.sh to remove binheap/round_robin and
  binheap/strict tests
@polybeandip
Copy link
Collaborator Author

polybeandip commented Sep 11, 2024

UPDATE on our benchmarks

cycles.sh generates cycle counts for all our designs. Remember to run gen_test_data.sh first!

For now, I've also reported this data in cycles.txt and below:

binheap/binheap_test.py: 750
binheap/fifo_test.py: 1509164
binheap/pifo_test.py: 1784719
binheap/round_robin/rr_2flow_test.py: 1870740
binheap/round_robin/rr_3flow_test.py: 1884450
binheap/round_robin/rr_4flow_test.py: 1897807
binheap/round_robin/rr_5flow_test.py: 1903303
binheap/round_robin/rr_6flow_test.py: 1934811
binheap/round_robin/rr_7flow_test.py: 1944544
binheap/stable_binheap_test.py: 1802173
binheap/strict/strict_2flow_test.py: 1785391
binheap/strict/strict_3flow_test.py: 1826179
binheap/strict/strict_4flow_test.py: 1842823
binheap/strict/strict_5flow_test.py: 1852314
binheap/strict/strict_6flow_test.py: 1851588
complex_tree_test.py: 1504412
fifo_test.py: 595422
pifo_tree_test.py: 1199525
round_robin/rr_2flow_test.py: 993313
round_robin/rr_3flow_test.py: 1013489
round_robin/rr_4flow_test.py: 1032869
round_robin/rr_5flow_test.py: 1051211
round_robin/rr_6flow_test.py: 1125919
round_robin/rr_7flow_test.py: 1136933
sdn_test.py: 1244054
strict/strict_2flow_test.py: 1058077
strict/strict_3flow_test.py: 1144637
strict/strict_4flow_test.py: 1233029
strict/strict_5flow_test.py: 1315433
strict/strict_6flow_test.py: 1481437

It looks like @csziklai's PIFOs do about about 300,000 to 1,000,000 cycles better than the binary heaps!

@anshumanmohan
Copy link
Contributor

!! Please tell the channel!

@rachitnigam
Copy link
Contributor

It looks like @csziklai's PIFOs do about about 300,000 to 1,000,000 cycles better than the binary heaps!

This sounds awesome but I would caution against celebrating the numbers too much without synthesis results. Something that is takes 1,000,000 cycles at 100MHz is still way faster than something that take 100,000 cycles at 1MHz.

@polybeandip
Copy link
Collaborator Author

Agreed! Synthesis results are coming up shortly

@polybeandip
Copy link
Collaborator Author

polybeandip commented Sep 12, 2024

UPDATE on our synthesis results

resources.sh generates "executive summaries" of the synthesis reports for all our designs: i.e. the output of

fud e --to resource-estimate <filename>.futil

for all our queues. Notably, this means our results (resources.txt) are missing certain stats (for example, BRAMs) and use the default .xdc file: i.e. a clock period of 7 ns. Per @sampsyo's suggestion, this frequency should be okay for now.

@anshumanmohan I think it would be odd to commit counts.txt and resources.txt — I've only added them to show folks reviewing this PR the results. As for where they should live, I'm thinking we make a discussion thread in the packet scheduling repo and park it there. Perhaps that thread could be a running log of synthesis results and cycles counts as we continue tweaking our designs?

@polybeandip polybeandip marked this pull request as ready for review September 12, 2024 20:45
@polybeandip
Copy link
Collaborator Author

polybeandip commented Sep 13, 2024

Pretty pictures!

cycles_round_robin.png

cycles_strict.png

lut_round_robin.png

lut_strict.png

registers_round_robin.png

registers_strict.png

worst_slack_round_robin.png

worst_slack_strict.png

muxes_round_robin.png

muxes_strict.png

@rachitnigam
Copy link
Contributor

rachitnigam commented Sep 13, 2024

Notably, this means our results (resources.txt) are missing certain stats (for example, BRAMs) and use the default .xdc file: i.e. a clock period of 7 ns. Per @sampsyo's suggestion, this frequency should be okay for now.

Thanks for running synthesis @polybeandip! The frequency number you want to report is going to be determined using "worst_slack". You can calculate the best frequency using best_freq = 1000/(7 - worst_slack) MHz. In general, you want compare the product cycle_count * best_freq to determine which design is better.

@polybeandip
Copy link
Collaborator Author

Ah, perfect! I'll crunch those numbers soon

@anshumanmohan
Copy link
Contributor

As for where they should live, I'm thinking we make a discussion thread in the packet scheduling repo and park it there. Perhaps that thread could be a running log of synthesis results and cycles counts as we continue tweaking our designs?

Agreed, thank you!

@polybeandip
Copy link
Collaborator Author

More graphs, this time using @rachitnigam's suggestion to compute total time spent on our workload. This way, we account for difference in clock speed.

muxes_round_robin.png

muxes_strict.png

@anshumanmohan
Copy link
Contributor

@polybeandip so @ayakayorihiro has the same request as Parth! Would you mind sending her the same things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants